1 Introduction

This report investigates how models behave for the hyperparameter of k in knn. Furthermore a evaluation of the bias and the residuals over time will be performed. These topics will be answered based on the posed questions posed in the report exercise.

2 Setup and Data cleaning

First, we load all data and we perform data cleaning and quality controls. These steps were specified in the excercise and so will not be discussed further.

set.seed(1982)



# Package names
packages <- c("lubridate","tidyverse","visdat","yardstick","purrr","reshape2","rsample","recipes","caret","zoo")
source("../R/load_packages.R")
load_packages(packages)
## [1] "All packages installed and loaded"
daily_fluxes <- read_csv("../data/FLX_CH-Dav_FLUXNET2015_FULLSET_DD_1997-2014_1-3.csv") |>  
  
  # select only the variables we are interested in
  dplyr::select(TIMESTAMP,
                GPP_NT_VUT_REF,    # the target
                ends_with("_QC"),  # quality control info
                ends_with("_F"),   # includes all all meteorological co-variates
                -contains("JSB")   # weird useless variable
                ) |>

  # convert to a nice date object
  dplyr::mutate(TIMESTAMP = ymd(TIMESTAMP)) |>

  # set all -9999 to NA
  dplyr::mutate(across(where(is.numeric), ~na_if(., -9999))) |> 
  
  # retain only data based on >=80% good-quality measurements
  # overwrite bad data with NA (not dropping rows)
  dplyr::mutate(GPP_NT_VUT_REF = ifelse(NEE_VUT_REF_QC < 0.8, NA, GPP_NT_VUT_REF),
                TA_F           = ifelse(TA_F_QC        < 0.8, NA, TA_F),
                SW_IN_F        = ifelse(SW_IN_F_QC     < 0.8, NA, SW_IN_F),
                LW_IN_F        = ifelse(LW_IN_F_QC     < 0.8, NA, LW_IN_F),
                VPD_F          = ifelse(VPD_F_QC       < 0.8, NA, VPD_F),
                PA_F           = ifelse(PA_F_QC        < 0.8, NA, PA_F),
                P_F            = ifelse(P_F_QC         < 0.8, NA, P_F),
                WS_F           = ifelse(WS_F_QC        < 0.8, NA, WS_F)) |> 

  # drop QC variables (no longer needed)
  dplyr::select(-ends_with("_QC"))

vis_miss(
  daily_fluxes,
  cluster = FALSE, 
  warn_large_data = FALSE
  )

We see that sadly, a lot of data is missing from the long wave infrared radiation. However since this data cleaning step is required as described in the exercise, models will be applied to this dataset version.

3 Data Evaluation

3.1 Linear Model

First we plot the performance of the lm model:

# linear regression model
source("../R/model_daily_fluxes.R") #the model function, to check parameters refer to the Function itself
source("../R/split_dataset.R")
source("../R/Evaluate_model_function.R") #This function takes many arguments, these will be elaborated step by step.

  split <- split_df(daily_fluxes,0.7)
  temp_lm<- model_daily_fluxes(split$train,split$test,method_loc = "lm") #returns the model with "lm" and a split proportion of 0.7, as well as the datasets of the splits. The return is a named list.
  print(eval_model(mod = temp_lm, df_train = split$train, df_test = split$test)) #The function creates an evaluation of the model based on the given datasets of training and evaluation. Naming can be specified with parameters as well as other outputs, these will be used later on.

3.2 KNN model

Now we plot the performance of the KNN model.

  temp_KNN<- model_daily_fluxes(split$train,split$test,method_loc = "knn",kofKNN = 8)

  print(eval_model(mod = temp_KNN, df_train = split$train, df_test = split$test)) 

4 Interpretation

4.0.0.1 Why is the difference between the evaluation on the training and the test set larger for the KNN model than for the linear regression model? Why does the evaluation on the test set indicate a better model performance of the KNN model than the linear regression model?

Because regression predicts what the connections are between variables for all variables (global) and so for both datasets behaves similarly because there is a lot of data in the training part that is all fitted with one line and so the regression line will miss also points of the training data (maybe underfit), while nearest neighbour calculates a local regression so it fits training data very well (somehow maybe overfitting?) but may not fit new data as well. Here this is not the case and the knn model appreas to outperform the lm model even in the test dataset and so is probably the better fit (RMSE is smaller). This is probably because the knn model captures local trends better due to the aforementioned variance.

4.0.0.2 How would you position the KNN and the linear regression model along the spectrum of the bias-variance trade-off?

Lm almost no variance and per definition no/very small bias. So this is not a very optimal case of a model.

KNN can have some bias and a balanced variance. This gives a nicer trade-off if neighbors (k) are not to few or too many.

4.0.0.3 Visualise temporal variations of observed and modelled GPP for both models, covering all available dates:

  print("Model knn")
## [1] "Model knn"
  temp_KNN<- model_daily_fluxes(split$train,split$test,method_loc = "knn",kofKNN = 8)

  print(eval_model(mod = temp_KNN, df_train = split$train, df_test = split$test,out = "plot_all")) 
## Warning: Removed 49 rows containing missing values (`geom_line()`).

print("Model lm")
## [1] "Model lm"
  temp_lm<- model_daily_fluxes(split$train,split$test,method_loc = "lm") #returns the model with "lm" and a split proportion of 0.7, as well as the datasets of the splits. The return is a named list.
  print(eval_model(mod = temp_lm, df_train = split$train, df_test = split$test,out = "plot_all"))
## Warning: Removed 49 rows containing missing values (`geom_line()`).

Not much variation in the residuals over time, residuals were used to show the variations of observed and modeled GPP. If there is a hiatus of data around January of 2008 and a the decrease arround 2011, this is visible in all models with all data. Of course, the bias of the lm in the training is 0 (definition of lm of trainingdata).

4.0.0.4 state a hypothesis for how the R2 and the MAE evaluated on the test and on the training set would change for k approaching 1 and for k approaching N (the number of observations in the data). Explain your hypothesis, referring to the bias-variance trade-off.

If k = 1, training data will be predicted with 0 error and perfect R2, but the test set performance will be very poor with a low R2 and a high MAE. If k approaches n, the training data will show a not optimal R2 and MAE because it is underfitted, and the test data will also not have a optimal fit since variation is not captured as desired. Therefore the ideal case must be somewhere in between.

4.0.0.5 Put your hypothesis to the test! Write code that splits the data into a training and a test set and repeats model fitting and evaluation for different values for k. Visualise results, showing model generalisability as a function of model complexity. Describe how a “region” of overfitting and underfitting can be determined in your visualisation. Write (some of your) code into a function that takes k as an input and and returns the MAE determined on the test set.

for (k in c(1,10,50,200)) {
  temp_KNN <- model_daily_fluxes(split$train,split$test,method_loc = "knn",kofKNN = k)
  print(paste("Model k:",k))
  print(eval_model(mod = temp_KNN, df_train = split$train, df_test = split$test, out = "plot")) 
  print(eval_model(mod = temp_KNN, df_train = split$train, df_test = split$test, out = "mae")) 
}
## [1] "Model k: 1"

## $mae_train
## [1] 0
## 
## $mae_test
## [1] 1.475588
## 
## [1] "Model k: 10"

## $mae_train
## [1] 1.054789
## 
## $mae_test
## [1] 1.152966
## 
## [1] "Model k: 50"

## $mae_train
## [1] 1.133233
## 
## $mae_test
## [1] 1.142134
## 
## [1] "Model k: 200"

## $mae_train
## [1] 1.192228
## 
## $mae_test
## [1] 1.195694

The aforementioned things are visible in these plots: with k = 1 the training dataset performes very well but the test dataset performs poorly with a high mae. As of for the other extreme, the model also appears to slightly decrease the R2 value as well as increase the MAE when the k becomes extremely high. This is due to underfitting with to many neighbors chosen. The model then doesnt capture local trends anymore.

4.0.0.6 Is there an “optimal” k in terms of model generalisability? Edit your code to determine an optimal k

For the “optimal” i would argue based on my current knowledge that we want the best performance possible in the MAE in the test dataset. We can loop over several possibilities of k and then do an evaluation:

mae_test <-  NULL #Vector to save the MAE values
values_k <- c(1,seq(5,to = 100, by = 5),125,150) #values to be tested
for (k in values_k) {
  
    temp_KNN <- model_daily_fluxes(split$train,split$test,method_loc = "knn",kofKNN = k)
  mae_test <- c(mae_test, eval_model(mod = temp_KNN, df_train = split$train, df_test = split$test, out = "mae")$mae_test)
  #a now output is desired: the R2 values. Therefore the model itself returns the R2 value of the test and training datasets as a list. There, only the test R2 is used.
  
} #This takes some time to load since a lot of models are evaluated.

mae_test <- tibble(k = values_k,mae_test = mae_test) #save values as a nice tibble


ggplot(data = mae_test,aes(x = k, y = mae_test))+
  geom_point()+geom_line()+theme_classic()+
  labs(x = "k of KNN",y = "MAE")#plot the data

mae_test |>
  filter(mae_test == min(mae_test)) #print the best performing model.
## # A tibble: 1 × 2
##       k mae_test
##   <dbl>    <dbl>
## 1    15     1.13

As predicted in our less elaborate approach, the MAE value of the test data increases at first (with some random fluctuations) until k = 15 when it starts to decrease again. Therefor k = 15 is our “optimal” k. Though this is not optimal. We should never use the test data to optimize the model. In a optimal case we would set aside another 20-30% of the data to perform final model performance.